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Abstract 

In order to compute the log-likelihood for high dimensional spatial Gaussian models, it is 
necessary to compute the determinant of the large, sparse, symmetric positive definite precision 
matrix, Q. Traditional methods for evaluating the log- likelihood for very large models may fail 
due to the massive memory requirements. We present a novel approach for evaluating such 
likelihoods when the matrix-vector product, Qv, is fast to compute. In this approach we utilise 
matrix functions, Krylov sub-spaces, and probing vectors to construct an iterative method for 
computing the log-likelihood. 



1 Introduction 

In computational and, in particular, spatial statistics, increasing possibilities for observing large 
amounts of data leaves the statistician in want of computational techniques capable of extracting 
useful information from s uch data. Large da t a set s arise in many applications, such as modelling 



temperature and cloud formations ([McPeters et al 



seismic data acquisition (IBuland and Omrd (12003 ) ); ana lysing satellite data for ozone intensity. 



(|l996l )): or constructing global climate models 



(jLindgren et alJ l|201ll )). Vlost models in spatial statistics are based around multivariate Gaussian 
distributions, which has probability density function 



p(x) = (2vr)- fc / 2 det(Q) i/ "exp 



V2, 



1 



x — fj,) T Q(x — fi 



where the precision matrix Q is the inverse of the covariance matrix. In this paper, we assume that 
the precision matrix is sparse, which essentially enforces a Markov property on the Gaussian random 
field. These models have better computational properties than those base d on th e covariance, and 
there are modelling r eason s to prefer using Q directly (jLindgren et alJ (|201ll )). We note that 
Rue and Tielmelandl (j2002l ) showed that it is possible to approximate general Gaussian random 
fields by Gaussian Markov random fields (GMRFs), that is Gaussian random vectors with sparse 
precision matrices. 

Throughout this paper, we will consider the common Gauss-linear model, in which our data is 
a noisy observation of a linear transformation of a true random field, that is 



y = A{6)x + ei 



(1) 



where the matrix A(0) models the observation of the true underlying field x, known as the 'forward 
model', while e ~ NfojQ^ 1 ) is the observation noise. In order to complete the model, we require 
a prior distribution on x, which we take to be Gaussian with mean fi and precision matrix Q x (r]), 
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and appropriate hyperpriors are given for the mean fj,, the hyperparameters in the prior rj and the 
hyperparameters in the forward model 0. 

Given a set of data, we wish to infer rj, 6 and x. The ways in which this is done can vary, but in 
the end, several determinant evaluations are needed. One way to do this is to alternately estimate 
x and rj, 6 using the distributions p(x\y, r], 6) and p(rj, 0\x, y), updating each consecutively. That 
is 

1. Find argmaXj, p(x\y, rj, 0) 

2. Find argmax TJ p(r/, 6\x, y) 

3. Repeat until convergence. 

In a Gauss-linear model, the first step involves a linear solve, while the second is an optimisation 
over the space in which r/, reside. The distribution is 

p(v> d \ x i v) « v{y\x, v, o)p(v, °\ x ) 

= P(y\x, e)p(x\rj)p(r])p(6) (2) 

The log of the last line in ([2]) gives the objective function for the hyper-parameters in this case: 

0) = - logp{y\x, e)p(x\rj)p(rj)p{e) (3) 

We also have that p(y\x, 6)p{x\r}) as a function of x is Af(fi p , Q p ), where Q p = Q x (rj)+A(0)' 1 QiA(O) 
and \i v = Qp 1 (Q x fi + A T Q\y). The separated expression, p(y\x, 0)p(x\r]) in <3>, is usually prefer- 
able, but this is problem dependent. 
Expanding ([3]), we get 

$(77, 6) =C - log(i det(Qi)) + hiv - A{6)x) T Q 1 {y - A(0)x)- 

log(- det(Q,(T7))) + (x- v) T Q x (x - p) - logpfa) " ^gp(6). (4) 

In this expression, the only term which is difficult to evaluate is log det(Q x (r/)), and it is needed 
for estimating the hyper-parameters, both for point estimates an d for Gaussian- or Laplace- 
approximation, of the hyper-parameters fsee lOariin and Louis' M for details on such approxi- 



mations). It is this evaluation and its use in optimisation we address in this article. 

2 Determinant evaluations 

The most common way to compute the log-determinant of a sparse precision or covariance matrix 
is to 1) reorder Q to optimise for Cholesky factorisation, 2) perform a Cholesky factorisation of 
the reordered matrix Q = LL T ', 3) extract the diagonal entries of I = diag(Z) and 4) set the log- 
determinant as logdetQ = Y^=i 2 log(^) (this comes from the identity detQ = detLdetL T = 
(detL) 2 ). The algorithm takes very few lines to pr ogram, given a good sparse matrix sorting 



routine, such as METIS (jKarypis and Kumarl (I199S ) ) and a fast sparse Cho lesky factoring im- 
plementation, such as CHOLMOD (jDavis and Hager (|1999MChen et al.1 (|2008l )). Problems occur, 



however, when there are massive amounts of fill-in in the Cholesky factorisation even after resorting 
the matrix in question. For a Gaussian Markov random field the dimensionality of the underlying 
parameter space affect the storage c ost for comput i ng th e Cholesky factorisation. In M 1 the cost is 
0(n), in M 2 , C(n 3 / 2 ) in M 3 , 0(n 2 ) (iRue and Heidi (|200Sh ). 
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2.1 Alternative approximations 

The starting point for an alternative, less memory intensive procedure for computing the log- 
determinant comes from the identity 



log det Q = tr log Q 



(5) 



In the symmetric positive definite case, this identity is proved noting that det Q = Y\2=i wnere 
{Xi} are the eigenvalues of Q and that log Q = V log(A)V T with A = diag(A) and V contains the 
eigenvectors of Q. Furthermore, tr (V log(A)V r ) = tr(W T logA) = tr log A, which gives the 
identity. 

How can this be useful in computation, is the next question. A trivial observation shows that 

n 

tr log Q = ^ e J lo s(Q) e j 



where ej = (0, . . . , 1, . . . , 0) where the one is in position j. While it is cumbersome to carry out this 
computation, it is the basis for stochastic estimators of the log-determinant. Such estimators have 
been studied for the trace of a matrix, the trace of the inver se of a matrix and our c ase, the trace of 
the lo garithm of a matrix. Details on this can be found in iHutchinsonl (jl98flh and iBai and Golubl 
(Il997h . The Hutchinson stochastic estimator is described as follows: 1) Let Vj, j = l,...,s be 
vectors with entries P(v k = 1) = 1/2, P(v k = —1) = 1/2 independently. 2) Let 



1 s 

tr \ogQ^-Y,vjlog(Q)vj. 



(6) 



As this is a Monte Carlo method, it is possible to compute confide nce regions for the est imators, 
using either Monte Carlo techniques or the Hoeffding inequality (|Bai and Golubl (|19971 )). The 
Hutchinson estimator was formulated for approximating trQ -1 in which case, we replace the logQ 
in ([6]) with Q- 1 . 

The Hutchinson estimator requires a lot of random vectors Vj to be sufficiently accurate for 
optimisation. The memory requirements are low, but we may have to wait an eternity for one 
determinant approximation. The question, then, can we choose the VjS in a clever way, so that we 
require far fewer vectors? 

In recent publications, Bekas et al. ( 2007 ) and Tang and Saad ( 20ld ) explored the use of probing 
vectors for extracting the diagonal of a matrix or its inverse. In the first of these the diagonal of a 
sparse matrix is extracted, and it is relatively straightforward under mild assumptions to extract 
the diagonal. The second relies on approximate sparsity of the inverse, where the approximate 
sparsity pattern of Q 1 is determined by a power of the original matrix, i.e. Q p . Assuming such a 
sparsity structure, it is possible to c ompute the probin g vectors {vj}j =1 by a neighbour colouring 
of the graph induced by Q p (see e.g. Culberson ( 1992!) for a survey on the greedy graph colouring 
algorithms). A heuristic suggested in Tang and Saad ( 2010l ) to find the power, p in Q p is to solve 



Qx = ej and find p = min{d(l, j)\\xi\ < e} where d gives the graph distance. In our case, we 
may compute log(Q)ej and use the same heuristic. A nice feature is that the probing vectors need 
not be stored, but may be computed cheaply on the fly. If we pre-compute them, they are sparse, 
and does not need much storage. Since what we need for each probing vector is vj log(Q)vj, we 
observe that the computation is highly parallel with low communication costs. Each node gets 
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one probing vector, and computes vj log(Q)vj and sends back the result. In essence, this leads to 
linear speedup with the amount of processors available with proportionality close to unity. 

Next, we need to consider the evaluation of \og(Q)vj. The matrices we consider have real 
positive spectrum, and it is possible to evaluate log(Q)vj through Cauchy's integral formula, 



log(Q)vj = <j> \og(x)(zI - Q) 1 vjdz, 



where T is a suitable curve enclosing the spectrum of Q and avoiding branch cuts of the logarithm. 
Direct quadratur e over such curves can be terribly inefficient, but through clever conformal map- 
pings, lHale et al. (120081 ) developed midpoint quadrature rules that converge rapidly for increasing 
number of quadrature points at the cost of needing estimates for the extremal eigenvalues of Q. In 
fact, || log Q - /jv(Q)|| = C(e- 27rAr /( log ( Am ^/ A ™") +6 )) with f N as below. This essentially gives us 
the rational approximation 



v 



\og(Q)vj ps f N (Q)vj = ^2 a i(Q ~ a i T ) lv j> a h o\ G C. 



(7) 



l=i 



In effect, we need to solve a family of shifted linear systems to approximate \og(Q)vj. How we 
compute fw(Q)vj is problem dependent, but in high dimensions, we usually have to rely on iter- 
ative methods, such as Krylov methods. A Krylov subspace, Kk(Q,v) is defined by fCk(Q,v) = 
span{i>, Qv, Q 2 v , . . . , Q k v} and a thorough introduction to the use and theory of Krylov methods 
can be found in lSaadl ( 20031 ). Which Krylov method we use is highly dependent on the quality and 



performance potential preconditioners for the matrix Q. 

While the Krylov method of choice is problem dependent, there are ones that are particu- 
larily well suited to compute the approximation in (J7|). These methods are based on the fact 
that K,f.(Q,v) = !Ck{Q — ail,v) and after some simple algebra, we ob tain the coe f ficient s for 
the shifted syst ems without computing new matrix- vector products, see Jegerlehnerl (19961 ) and 
Frommer (j2003l ) for details. We have employed the method CG-M in I Jegerlehnerl ( 1996 ) our com- 
putations. One possible difficulty in employing the method is that we have complex shifts - this 
is remedied by using a variant, Conjugate Orthogonal CG-M (COCG-M), which entails using the 
conjugate symmetric f orm (x, y) = x T y instead of the usual inner product (x,y) = x T y in the 
Krylov iterations. See van der Vorst and Melissen (1990) for a description of the COCG method. 
In practice, little complex arithmetic is needed, since the complex, shifted coefficients are computed 
from the real ones obtained by the CG method used to solve Qx = y. 



3 Examples 

In this section we explore how optimisation fares under different conditions. For doing this, we 
assume essentially the simplest possible model, but we plan use the outlined approach on a seismic 
case in the future. The model we assume is the SPDE t(k — A)u = W, in 2D, which we observe 
directly. We will explore how optimisation works (on r, k) for different ks and different distance 
colouring of the graph. Note that the number of colours needed is essentially independent of the 
granularity of the discretisation: a fine grid yields approximately the same number of colours 
as a coarse grid. The initial suspicion is that for small ks, corresponding to long range will be 
harder to optimise in the following sense: we need more Krylov iterations for the COCG-M routine 
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to converge and we need a larger distance colouring to cover the increasing range, resulting in 
more probing vectors. We use a modified Newton method for optimisation and compare using the 
exact determinant to using the approximation outlined above. For the COCG-M method, we use 
a relative tolerance of 10~ 3 for computing log(Q)vj. Note that a prior on the parameters will 
stabilise the results as usual. The results are given in Table [TJ 



Table 1: Optimisation of (k, t) for different distance colourings 





Exact 


2-dist 


4-dist 


6-dist 


8-dist 


10-dist 


K = 1 


(0.927,1.015) 


(1.06,0.98) 


(0.933,1.013) 


(0.927,1.015) 






K = 0.5 


(0.455,1.010) 


(0.605,0.961) 


(0.471,1.005) 


(0.457,1.009) 


(0.455,1.010) 




K = 0.1 


(0.0842,1.003) 


(0.208,0.940) 


(0.122,0.984) 


(0.0983,0.996) 


(0.0891,1.000) 


(0.0859,1.002) 


k = 0.05 


(0.0398,1.001) 


(0.138,0.941) 


(0.0762,0.980) 


(0.0567,0.992) 


(0.0475,0.997) 


(0.0434,0.999) 


K = 0.01 


(0.00644,0.998) 


(0.0565,0.947) 


(0.0292,0.978) 


(0.197,0.987) 


(0.0143,0.992) 


(0.0117,0.994) 



In Table [TJ . . . indicates that the optimisation yields the same as the previous entry. We see 
that as we increase the graph neighbourhood in our colourings, we get results closer and closer to 
that of using the exact determinant. We also see that the estimates are monotone: k decreases 
with larger distance colourings and r increases. Lastly, we see that some of the estimates are better 
when using the approximation. This should not necessarily be taken as a good sign, as it may only 
be a result of approximation errors. 

As increasing k in fc-distance colourings yields better and better approximations, one could be 
lead to using a "large" k in the entire optimisation, whatever method one uses to determine k. 
Our results indicate, however, that we only need very good approximations in the last iterations of 
the optimisation procedure. In effect, we may use 2-distance colourings in the beginning, and go 
to 5- or larger -distance colourings in the last steps. Computing the colourings is cheap and one 
colouring only requires the storage of one vector, so we may store a couple of different colourings 
and use them as required in the optimisation procedure. 

Lastly, we present an example that cannot be done using black-box Cholesky factorisations. 
Namely, a 3-D version of the model above with k = 0.5 and 2 million discretisation points. We 
use a 1-distance colouring for the first iterations and increase to 2- and 6-distance colouring in 
the last iterations. This will give us temporary /c-distance estimates which we also give. The 
estimates, progressively from 1-, 2 and 6-distance colouring were (7.627,0.382), (1.401,0.801), 
and (0.561,0.988). It took 24 hours to complete the optimisation. From this we conclude - the 
method is slow, but it can be used for parameter estimation in high-dimensional problems where 
other alternatives are impossible due to memory limitations. We must, however, be careful so that 
we have enough colours to capture the essentials of the determinant. The estimated memory use 
for using Cholesky factorisation in the determinant evaluations is 155 Gb with METIS sorting of 
the graph and much higher without. Few computing servers have this amount of memory on a 
node. The memory consumption for the approximation was 3 Gb at maximum, and even this may 
be lowered quite a lot with some clever memory management. In a computing cluster, the time 
for computing this optimisation would have been much lower due to linear speedup vs. number of 
nodes. 

We have tried the approximation for different types of matrices, and what we found is that the 
examples above are among the toughest to do determinant approximations on. Since we can do 
reasonable approximations for this class, we expect that these likelihood evaluations will work for 
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a large class of precision matrices in use in statistics. 



4 Discussion 

We have showed that the determinant approximations discussed shows promise for likelihood eval- 
uations in models where we cannot perform Cholesky factorisations or Kronecker decompositions 
of the precision matrices. This may prove useful in high dimensional models where approximate 
likelihoods are not sufficient for accurate inference. It remains to utilise the approximations on a 
real world dataset. 
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